Statistical Practises of Educational Researchers : An Analysis of Their ANOVA , MANOVA and ANCOVA Analyses

نویسندگان

  • H. J. Keselman
  • Carl J Huberty
  • Lisa M. Lix
  • Stephen Olejnik
  • Robert A. Cribbie
  • Barbara Donahue
  • Rhonda K. Kowalchuk
  • Laureen L. Lowman
  • Martha D. Petoskey
  • Joanne C. Keselman
  • Joel R. Levin
چکیده

Articles published in several prominent educational journals were examined to investigate the use of data-analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, we also catalogued whether: (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected based on power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. Our analyses imply that researchers rarely verify that validity assumptions are satisfied and accordingly typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. We offer many recommendations to rectify these shortcomings. Data Analytic Practices 3 Statistical Practises of Educational Researchers: An Analysis of Their ANOVA, MANOVA and ANCOVA Analyses It is well known that the volume of published educational research is increasing at a very rapid pace. As a consequence of the expansion of the field, qualitative and quantitative reviews of the literature are becoming more common. These reviews typically focus on summarizing the results of research in particular areas of scientific inquiry (e.g., academic achievement or English as a second language) as a means of highlighting important findings and identifying gaps in the literature. Less common, but equally important, are reviews that focus on the research process, that is, the methods by which a research topic is addressed, including research design and statistical analysis issues. Methodological research reviews have a long history (e.g., Edgington, 1964; Elmore & Woehlke, 1988, 1998; Goodwin & Goodwin, 1985a, 1985b; West, Carmody, & Stallings, 1983). One purpose of these reviews has been the identification of trends in data-analytic practice. The documentation of such trends has a two-fold purpose: (a) it can form the basis for recommending improvements in research practice, and (b) it can be used as a guide for the types of inferential procedures that should be taught in methodological courses, so that students have adequate skills to interpret the published literature of a discipline and to carry out their own projects. One consistent finding of methodological research reviews is that a substantial gap often exists between the inferential methods that are recommended in the statistical research literature and those techniques that are actually adopted by applied researchers (Goodwin & Goodwin, 1985b; Ridgeway, Dunston, & Qian, 1993). The practice of relying on traditional methods of analysis is, however, dangerous. The field of statistics is by no means static; improvements in statistical procedures occur on a regular basis. In particular, applied statisticians have devoted a great deal of effort to understanding the operating characteristics of statistical procedures when the distributional assumptions that underlie a particular procedure are not likely to be satisfied. It is common knowledge that under certain data-analytic conditions, statistical procedures will not produce valid results. The applied researcher who routinely adopts a traditional procedure without Data Analytic Practices 4 giving thought to its associated assumptions may unwittingly be filling the literature with nonreplicable results. Every inferential statistical tool is founded on a set of core assumptions. As long as the assumptions are satisfied, the tool will function as intended. When the assumptions are violated, however, the tool may mislead. It is well known that the general class of analysis of variance (ANOVA) tools frequently applied by educational researchers, and considered in this article, includes at least three key distributional assumptions. For all cases the outcome measure Y (or ki “score") associated with the -th individual within the -th group is and i k normally independently distributed, with an mean of and a variance of . Importantly, because does not include a . 5 5 k 2 2 k subscript, this indicates that the score variances within all groups are equal (variance homogeneity). Only if these three assumptions are met can the traditional tests of mean differences be F validly interpreted, for without the assumptions (or barring strong evidence that adequate compensation for them has been made), it can be -and has been -shown that the resulting “significance” probabilities ( -values) are, at best, somewhat different from what they should be p and, at worst, worthless. Concretely, what this means is that an assumptions-violated test of group effects might yield a ratio with a corresponding significance probability of .04, which F p œ (based on an Type I error probability of .05) would lead a researcher to conclude that a priori there are statistically nonchance differences among the groups. However, and unknown to the K unsuspecting researcher, the “true” probability of the obtained results, given a no-difference hypothesis and violated assumptions, could perhaps be .37, contrarily suggesting that the p œ observed differences are likely due to chance. And, of course, the converse is also true: A significance probability that leads a researcher to a no-difference conclusion might actually be a case of an inflated Type II error probability stemming from the violated distributional assumptions. The “bottom line” here is that in situations where a standard parametric statistical test's assumptions are suspect, conducting the test anyway can be a highly dangerous practice. In this Data Analytic Practices 5 article, we not only remind the reader of the potential for this danger but, in addition, provide evidence that the vast majority of educational researchers are conducting their statistical analyses without taking into account the distributional assumptions of the procedures they are using. Thus one purpose of the following content analyses (based on a sampling of published empirical studies) was to describe the practices of educational researchers with respect to inferential analyses in popular research paradigms. The literatures reviewed encompass designs that are commonly used by educational researchers -that is, univariate and multivariate independent (between-subjects) and correlated groups (within-subjects) designs that may contain covariates. In addition to providing information on the use of statistical procedures, the content analyses focused on topics that are of current concern to applied researchers, such as power analysis techniques and problems of assumption violations. Furthermore, consideration was given to the methodological sources that applied researchers use, by examining references to specific statistical citations. Our second purpose, based on the findings of our reviews, is to present recommendations for reporting research results and for obtaining valid methods of analysis. Prominent educational and behavioral science research journals were selected for review.1 An enumeration of the journals reviewed can be found in Table 1. These journals were chosen because they publish empirical research, are highly regarded within the fields of education and psychology, and represent different education subdisciplines. To the extent possible, all of the articles published in the 1995/1994 issue of each journal were reviewed by the authors. The Analysis of Designs Between-Subjects Univariate Past research has shown that the ANOVA test is the most popular data-analytic F technique among educational researchers (Elmore & Woehlke, 1998; Goodwin & Goodwin, 1985a) and that it is used most frequently within the context of one-way and factorial betweensubjects univariate designs. However, researchers should be aware that although the ANOVA F test is the conventional approach for conducting tests of mean equality in between-subjects designs, it is not necessarily a approach, due to its reliance on the assumptions of normality valid and variance homogeneity. Specifically, recent surveys indicate that the data collected by Data Analytic Practices 6 educational and psychological researchers rarely if ever come from populations that are characterized by the normal density function or by homogeneous variances (Micceri, 1989; Wilcox, Charlin & Thompson, 1986). Hence, as previously indicated, the validity of statistical procedures that assume this underlying structure to the data is seriously in question. Specifically, the effect of using ANOVA when the data are nonnormal and/or heterogeneous is a distortion in the rates of Type I and/or Type II errors (or, the power of the test), particularly when group sizes are unequal. In this content analysis we examined the method(s) adopted for testing hypotheses of mean equality involving main, interaction, and/or simple between-subjects effects. Methods for testing omnibus (overall) hypotheses could include the ANOVA test or an alternative to the F F test. Alternative test procedures could include the nonparametric Kruskal-Wallis test (Kruskal & Wallis, 1952) or the Mann-Whitney U test (in the case of two groups), as well as various parametric procedures such as the Brown and Forsythe, James, and Welch tests (see Coombs, Algina & Oltman, 1996) which are all relatively insensitive to the presence of variance heterogeneity. Trend analysis may also be used in cases where the levels of the between-subjects factor(s) are quantitative, rather than qualitative in nature. As well, planned ( ) contrasts on a priori the data may be used to answer very specific research questions concerning one's data. The use of multiple comparison procedures (MCPs) for testing hypotheses concerning pairs of between-subjects means was also examined. The specific strategy adopted to control either the familywise rate of error (FWE) or the per-comparison rate of error (PCE) was identified, as was the type of test statistic used. In between-subjects designs, the pairwise comparison test statistic may be computed in different ways, depending on the assumptions the researcher is willing to make about the data (see Maxwell & Delaney, 1990, pp. 144-150). For example, in a one-way design, one test statistic (which we will call ) incorporates the single-error error term from the omnibus test of the between-subjects effect. Accordingly, the variance homogeneity assumption must be satisfied for such an approach to provide valid tests of pairwise comparisons. The alternative ( ) uses an error term based on only data associated separate-error Data Analytic Practices 7 with the particular levels of the between-subjects factor that are being compared. In the latter approach, which does not assume homogeneity across all factor levels, each pairwise comparison statistic has a separate-error term. In unbalanced (unequal cell sizes) factorial designs, also known as nonorthogonal designs, the sums of squares (SS) for marginal (e.g., main) effects may be computed in different ways. That is, tests of weighted or unweighted means may be performed depending on the hypotheses of interest to the researcher (see Carlson & Timm, 1974). Research Design Features and Methods of Analysis Table 2 contains information pertaining to design characteristics of the 61 betweensubjects articles which were examined in this content analysis. One-way designs (59.0%) were more popular than factorial designs (47.5%). However, it should be noted that there was some overlap with respect to this classification, as four articles reported the use of both types of designs. Overall, unbalanced designs were more common than balanced designs. This is particularly evident in the case of studies involving factorial designs, where almost three-quarters of those identified (72.4%) were comprised of cells containing unequal numbers of units of analysis. Of the 23 one-way studies in which an unbalanced design was used, the ratio of the largest to the smallest group size was greater than 3 in 43.5% of these. Of the 21 unbalanced factorial studies, the ratio of the largest to the smallest cell size was greater than 3 in 38.1% of these. Table 2 also contains information pertaining to the methods of inferential analysis in the studies which incorporated a between-subjects univariate design. The ANOVA test was F overwhelmingly favored, and was used by researchers in more than 90% of the articles. A nonparametric analysis was performed by the authors of only four articles; in each of these, oneway designs were under investigation. Planned contrasts were reported in two articles, in both cases for assessing an effect in a one-way design. Trend analysis was used by the authors of one article, also in relation to the analysis of a one-way design. Data Analytic Practices 8 In only 3 of the 21 articles in which a nonorthogonal design was used did the authors report the method adopted to compute the SS for marginal effects. In two of these, unweighted means were adopted and in one weighted means were used. In two articles, both involving one-way designs, the authors did not conduct a test of the omnibus hypothesis, and instead proceeded directly to pairwise mean comparisons. In total, 29 articles reported the use of a MCP (46.8%). Tukey's procedure was most popular (27.6%), followed by the Newman-Keuls method (20.7%) (see Kirk, 1995 for MCP references). In only three instances (10.3%) did the author(s) conduct unprotected multiple -tests, which allow for t control of the PCE rather than the FWE. Little difference existed in the popularity of MCPs for the analysis of one-way and factorial designs; in both cases Tukey's procedure was favored. However, Duncan's procedure was only used for testing hypotheses involving pairs of means in one-way designs and the Newman-Keuls procedure was more popular in factorial designs than in one-way designs. It has been shown that both the Fisher and Newman-Keuls procedures cannot control the FWR when more than three means are compared in a pairwise fashion (Keselman, Keselman, & Games, 1991; Levin, Serlin, & Seaman, 1994). Despite this, half of the studies in which the Newman-Keuls procedure was adopted contained more than three means, while Fisher's procedure was used in one such study. MCPs were used in factorial designs more often to test for differences in pairs of marginal means ( = 9), than to test pairs of simple means ( = 6). Finally, with respect to the test statistic n n used in the MCP analyses, in only one article was it possible to discern that a test separate-error statistic had been adopted. In this case, which involved a one-way design, multiple tests were t conducted, and the authors did not perform a preliminary omnibus analysis. Assessment of Validity Assumptions With respect to the assessment of validity assumptions, our first task was to examine possible departures from variance heterogeneity. Thirteen of the 61 articles which incorporated between-subjects univariate designs did not report group or cell standard deviations for any of the Data Analytic Practices 9 dependent variables under investigation. For the remaining articles, we focused our attention on at most the first five variables that were subjected to analysis in order to limit the data set to a manageable size. For one-way designs, we collected standard deviation information for 86 dependent variables. The average value of the ratio of the largest to smallest standard deviation was 2.0 ( = 2.6), with a median of 1.5. Several extreme ratio values were noted in the one-way SD designs, with a maximum ratio of 23.8. In the factorial studies, information was obtained for 85 dependent variables, with a mean ratio of 2.8 ( = 4.2), a median of 1.7, and a maximum ratio of SD 29.4. For one-way designs, a positive relationship between group sizes and standard deviations existed for 31.3% of the dependent variables, a negative relationship was identified for 22.1%, no discernible pattern was observed for 15.1%, and this classification was not applicable for 25.6% of the dependent variables because group sizes were equal. For five dependent variables it was not possible to categorize this relationship because group size information was not provided. For factorial designs, a negative relationship between cell sizes and standard deviations was revealed for 23.5% of the dependent variables, a positive relationship was evident for 14.1%, and no relationship was evident for 31.8% of the dependent variables. As well, this relationship was not applicable for 14.1% of the dependent variables because the design was balanced. In 12 articles (19.7%), the author(s) indicated some concern for distributional assumption violations. Normality was a consideration in seven articles, although no specific tests for violations of this assumption were reported; rather, it appears that normality was assessed by descriptive measures only. Variance homogeneity was evaluated in five articles, and it was specifically stated that this assumption was tested in three of these articles. Only one article considered both assumptions simultaneously. The authors of these articles used a variety of methods to deal with assumption violations. In total, five studies relied on transformations; typically, these were used where the dependent variables of interest were measured using a percentage scale. A nonparametric procedure was adopted in two articles, in one because the dependent variable under investigation was skewed, Data Analytic Practices 10 and in the other because variances were heterogeneous. One set of authors tested for heterogeneity using Levene's (1960) test and obtained a significant result, but chose to proceed with use of the ANOVA test. In two articles where skewness was due to outliers, these values F were Winsorized; that is, the extreme scores were replaced with less extreme values. In one case, the authors chose to redesign the study in order to avoid dealing with nonnormal data. Thus, although a 2 4 factorial design was originally employed, it was reduced to a 2 2 design ‚ ‚ because of nonnormality due to floor effects in four cells of the design. Finally, the authors of one study elected to convert a continuous dependent variable to a categorical variable, and then they conducted a frequency analysis rather than a means analysis due to the existence of skewness in the data. Power/Effect Size Analysis The issue of power and/or effect size calculations arose in only 10 articles (16.1%). Effect sizes were calculated in six of these, but the statistic used was not routinely reported and main effects were more often of interest than interactions. The authors of two articles were concerned that the power to detect an interaction might be low, and thus performed post hoc analyses of power. The authors of one article reported that although the independent variable under investigation was quantitative in nature, it was converted to a categorical variable and the ANOVA test was used instead of regression analysis. This was done because the authors felt F that the former approach would result in greater statistical power than the latter; however, no empirical support for this premise was given. Software Packages/Statistical Citations The statistical software package used in data analysis was specified in only five articles. In three of these, SPSS (Norusis, 1993) was used while SYSTAT (Wilkinson, 1988) and SAS (SAS Institute, 1990) were each used once. A variety of statistical sources were cited in the articles. However, no single source was used with great frequency and thus this component of the analysis was unrevealing. Data Analytic Practices 11 Conclusions and Recommendations Concerning Between-Subjects Univariate Designs This review reveals that behavioral science researchers use between-subjects univariate designs in a variety of contexts. Investigations involving a single between-subjects factor were favored slightly more than those in which the effects of multiple factors were jointly considered, although in both cases, designs with unequal group sizes were more popular than designs with equal group sizes. As anticipated, the ANOVA test was the method of choice for examining group effects, F despite its reliance on the stringent assumptions of normality and variance homogeneity. This is a disturbing trend, as Lix, Keselman, and Keselman (1996), in a quantitative review of the effects of assumption violations on the ANOVA test in one-way designs, found very few instances in F which this conventional method of analysis was appropriate. Although the ANOVA test may be F relatively insensitive to violations of the normality assumption in terms of Type I error control, it is highly sensitive to differences in population variances. This sensitivity is accentuated when group sizes are unequal. Similar findings have been reported by Keselman, Carriere, and Lix (1995) and Milligan, Wong, and Thompson (1987) with respect to factorial designs, regardless of the method used to compute the sums of squares for marginal effects. Normality does, however, have important implications for the control of Type II errors (Wilcox, 1995). The routine use of the test in the face of assumption violations may stem from the fact F that behavioral science researchers do not appear to give a great deal of thought to assumption violations, as less than 20% of the articles considered in this review made mention of this issue. When it was clear that assumptions were considered, normality was more likely to be of concern than variance homogeneity, and transformations were typically used as a means of normalizing the distribution of responses. Although the adoption of a nonparametric procedure may be useful when the normality assumption is untenable, it is not good practice when the assumption of variance homogeneity is suspect. The Lix et al. (1996) review showed that the Kruskal-Wallis test (Kruskal & Wallis, 1952) is highly sensitive to unequal variances. Data Analytic Practices 12 It is equally important to consider the underlying distributional assumptions when pairwise comparisons of means or other contrasts are performed on the data. In only one paper was the choice of a test statistic specified (in that case a statistic was used), and separate error thus it was difficult to determine what assumptions the majority of researchers were making about the data in testing hypotheses involving pairs of means. It is interesting to note that in only two studies did the authors not elect to perform omnibus tests of between-subjects effects. Rather, the more common practice was to perform one or more omnibus tests, which, if significant, were followed by simple effect tests and/or pairwise comparisons of means. As anticipated, effect sizes were almost never reported along with -values, despite p encouragement to do so by the most recent edition of the American Psychological Association's (1994) . Moreover, indications of the magnitude of interaction effects were Publication Manual extremely rare. Finally, it should be noted that in all instances where effect sizes were given, a statistically significant result was obtained. We feel there are a number of ways in which behavioral science researchers can improve their analyses of between-subjects univariate designs. We strongly encourage: (a) selecting robust methods for conducting omnibus tests and contrasts, (b) conducting focused tests of hypotheses, and (c) routinely reporting measures of effect. With respect to the first point, many studies have demonstrated that the ANOVA test is F very frequently inappropriate to test for the presence of group mean differences in betweensubjects designs (see e.g., Wilcox, 1987). Despite these repeated cautionary notes, behavioral science researchers have clearly not taken this message to heart. It is strongly recommended that test procedures that have been designed specifically for use in the presence of variance heterogeneity and/or nonnormality be adopted on a routine basis. A number of research reviews give clear information on selection of robust methods. A good starting point is the paper by Lix et al. (1996), which documents the deficiencies of the test (see also Harwell, Rubenstein, Hayes, F & Olds, 1992) and provides clear guidelines on the conditions under which various robust Data Analytic Practices 13 procedures -including the Welch and James procedures -will exhibit optimal results. Also included in that paper is a discussion of computer programs that will perform these tests. Procedures that are robust to both variance heterogeneity and nonnormality are considered by Lix and Keselman (1998). A discussion of robust methods for use in factorial designs can be found in Keselman et al. (1995) and Keselman, Kowalchuk, and Lix (1998) -see also Hsiung and Olejnik (1994b). The application of robust methods for conducting pairwise mean comparisons is considered by Keselman, Lix, and Kowalchuk (1997), Lix and Keselman (1995), and Olejnik and Hess(1997). With respect to the second point, behavioral science researchers need to critically evaluate the usefulness of conducting preliminary omnibus tests of main and/or interaction effects. As Olejnik and Huberty (1993) note, “the most important limitation of the omnibus -test is that it is F so general that it typically does not address an interesting substantive question” (p. 7). It was typically the case that if a significant omnibus result was obtained, it was followed with additional tests to provide further information on the nature of the effect, such as pairwise mean comparisons. It is entirely possible to bypass the omnibus test and proceed directly to simple effect tests or pairwise comparisons, although a few MCPs do incorporate a preliminary test. A comprehensive discussion of the use of planned contrasts for data analysis can be found in most popular research methods/statistics textbooks, including Kirk (1995), and Maxwell and Delaney (1990), as well as in the work of Hsiung and Olejnik (1994a). With respect to the third point, numerous sources have discussed the need for reporting a measure of effect size along with a -value, in order to allow the reader to distinguish between p those results that are “practically” significant and those that are only “statistically” significant. Although it is encouraging that a small number of the articles reviewed in the current content analysis reported a measure of effect or some form of power analysis, this type of information needs to be routinely reported. Educational researchers have at their disposal numerous sources on this topic, including Cohen (1992), Kirk (1996), and O'Brien and Muller (1993), as well as the recent compendium by Harlow, Muliak, and Steiger (1997). Data Analytic Practices 14 The Analysis of Between-Subjects Multivariate Designs Univariate ANOVA actually involves more than one characteristic of the (experimental) units involved. There is one outcome variable; but there can be more than one grouping variable. It is the effect of the grouping variable(s) on the outcome variable that is of interest to the researcher who employs ANOVA techniques. Multivariate analysis of variance (MANOVA) can have one or more grouping variables, but would include multiple outcome variables (say, in P number). It is the effect of the grouping variable(s) on the collection of outcome variables that is of interest to the researcher who uses MANOVA techniques. Just as in the case of an ANOVA with one grouping variable, the interest in a MANOVA with one grouping variable is group comparison. Groups are compared with respect to means on one or more linear composites of the outcome variables. That is, in a MANOVA context, it is the effect of the grouping variable(s) on the linear composite(s) of the outcome variables that is (or should be) of interest to the researcher. As we indicated in our introduction, all ANOVA-type statistics, require that data conform to distributional assumptions in order to provide valid tests of statistical hypotheses. The validity assumptions for MANOVA include multivariate normality, homogeneity of the covariance P P ‚ matrices, and independence of observations. Empirical findings indicate that when these assumptions are not satisfied rates of Type I and II errors can be seriously distrorted, particularly in nonorthogonal designs (see Christensen & Rencher, 1997; Coombs et al., 1996). Research Design Features and Methods of Analysis What was looked for in the articles reviewed for this content analysis was information related to the conduct of a MANOVA. A summary of some of the information reported for the 79 articles which were examined is given in Table 3. First, it is sometimes argued by methodologists that, when reasonable, aspects of randomization should be considered in designing a group-comparison study. In only 20 of the 79 studies was randomization considered; 6 involved random selection and 14 involved random assignment. With regard to sample size, one study included an apology for the relatively small sample size used. In another study, it was recognized that “large” s were used; therefore, a N Data Analytic Practices 15 relatively low -value was selected as a cut-off value in determining “significance.” Two p “conceptually distinct” sets of outcome variables were used in one study; this notion plus the ratios of minimum group size to the number of outcome variables were used by the authors to justify two MANOVAs rather than one MANOVA. [A recommendation that has been proposed is that the smallest group size should range from 6 to 10 (Huberty, 1996).] Statistical power was P P explicitly addressed in only five articles. For about 76% (60/79) of the studies, tables of group-by-variable means (and standard deviations) were reported. A matrix of outcome variable intercorrelations was reported in only eight articles. In an overwhelming 84% (66/79) of the studies, researchers never used the results of the MANOVA(s) to explain effects of the grouping variable(s). Instead, they interpreted the results of multiple univariate analyses. In other words, the substantive conclusions were drawn from the multiple univariate results rather than from the MANOVA. Having found the use of such univariate methods, one may ask: Why were the MANOVAs conducted in the first place? Applied researchers should remember that MANOVA tests linear combinations of the outcome variables (determined by the variable intercorrelations) and, therefore does not yield results that are in any way comparable with a collection of separate univariate tests. Although it was not indicated in any article, it is surmised that researchers followed the MANOVA-univariate data analysis strategy for protection from excessive Type I errors in the univariate statistical testing. This strategy may not be too surprising because it is suggested by some book authors (e.g., Stevens, 1996, p. 152; Tabachnick & Fidell, 1996, p. 376). There is very limited empirical support for this strategy. A counter position may be stated simply as: Do not conduct a MANOVA unless it is the multivariate effects that are of substantive interest. If the univariate effects are those of interest, then it is suggested that the researcher go directly to the univariate analyses and bypass MANOVA. When doing the multiple univariate analyses, if control over the overall Type I error is of concern (as it often should be), then a Bonferroni (Huberty, 1994, p.17) adjustment or a modified Bonferroni adjustment may be made. (For a more Data Analytic Practices 16 extensive discussion on the MANOVA versus multiple ANOVAs issue, see Huberty and Morris, 1989.) Focusing on results of multiple univariate analyses preceded by a MANOVA is no more logical than conducting an omnibus ANOVA but focusing on results of group contrast analyses (Olejnik & Huberty, 1993). If multivariate effects are of interest, then some descriptive discriminant analysis (DDA) techniques would be appropriate (see Huberty, 1994, ch. XV). DDA techniques were used in only four of the 79 studies reviewed. In that one study, four linear discriminant functions were substantively interpreted in discussing group separation. In this same study, techniques of predictive discriminant analysis (PDA) were used “as a descriptive tool to highlight and to further clarify the results (of the DDA).” A second study also mentioned the use of PDA techniques; but by “mixing” PDA and DDA techniques to arrive at classification rules, the analysis lost its meaningfulness. Assessment of Validity Assumptions It was disappointing, but perhaps not too surprising, that in only a small percent of the 79 studies were data conditions considered. As indicated, the data conditions of some concern in a MANOVA context pertain to multivariate normality and covariance matrix equality. No studies even mentioned the latter condition. In one study the authors tested for “homogeneity of variances” (which applies only to the univariate context). In six studies, data transformations were used; two studies used the arcsine transformation of proportions and one study used a square root transformation of percents. In one of the repeated measures (RM) MANOVA studies, the condition of sphericity was considered. Very extensive consideration of data conditions was made in one article: normality, covariance matrix homogeneity, sphericity, outliers, covariate regression slopes, and multicollinearity. Power/Effect Size Analysis Effect size index values were reported in only eight of the 79 articles. Seven studies used univariate indexes and one study reported multivariate eta-squared values. The actual statistical test criterion (e.g., Wilks) was reported in only a handful of studies; rather, an value was F Data Analytic Practices 17 reported (usually without any indication of degrees of freedom [ ]). All of the four popular test df criteria (Bartlett-Pillai, Hotelling-Lawley, Roy, Wilks) may be transformed to values, so the F reporting of an value does not tell the reader which criterion was used (Huberty, 1994, p. 189). F If no criterion value is reported, the reader has some difficulty in arriving at an effect size index value. Software Packages/Citations Only 12 of the 79 studies stated the software package used and only 28 of the articles included references to data analysis books and/or articles. This is somewhat surprising considering the data analysis methods used. It may be worth mentioning that even though all 79 articles reviewed were published in 1994 and 1995, some of the data analysis references were not to later editions of books, but rather to editions in the 1980s or before. Conclusions and Recommendations Concerning Between-Subjects Multivariate Designs In this section we suggest information that can (should?) be reported in a study that involves a multiple-group, multiple-variable, design in which a MANOVA would be considered. Pre-Analysis Outcome variables. Ideally the collection of outcome variables should constitute a variable in the sense that the variables conceptually and substantively “hang together.” system This initial choice of variables may be based on substantive theory, previous research, expert advice, and professional judgment. The rationale used for including multiple related variables measuring one or more underlying construct(s) should be made clear. Explicit listing (e.g., in a table) of all outcome variables and how each is measured would enhance manuscript readability. Any use of data transformations should be reported. The reporting of the reliability of the measures for each outcome variable would be a real plus. Outlying observation vectors. As is well known, a few outliers can “foul up” an analysis in surprising ways. An indication that a search for outliers was conducted and steps taken, if any, should be stated. For a discussion of outlier detection in psychology, see Orr, Sackett, and Dubois (1991). Data Analytic Practices 18 Completeness of data matrix. The manner of handling missing data should be discussed (see for example, Roth, 1994). A second search for outliers may be conducted after the data matrix is completed. Data conditions. A brief discussion of the extent to which the available data satisfy the conditions of group multivariate normality and equal group covariance matrices should be given. If there is concern about the equality of covariance matrices then various robust alternatives are available (see e.g., Christensen & Rencher, 1997; Huberty, 1994, pp. 199, Coombs et al., 1996; 203). In the two-group problem where H : ( indicates a vector of two or more 0 1 2 k . . . œ variable means), researchers can adopt the procedures due to Kim (1992) or Johansen (1980). For the many-group problem where the hypothesis to be tested is H : , 0 1 2 K . . . œ œ á œ researchers can choose from among the procedures due to Coombs and Algina (in press), James (1954) or Johansen (1980) (see Coombs & Algina, 1996; Coombs et al.). Current findings suggest that for many of the parametric conditions likely to be encountered by behavioral science researchers these procedures should adequately control Type I error; that is, they should provide robust tests of their respective null hypothesis. Assessment of covariance matrix equality and of P-variate normality, including the use of statistical package programs, are discussed by Huberty and Petoskey (in press). Analysis Descriptives K P . There are three basic types of descriptive information for a -group, variable MANOVA situation that should be reported: means and standard deviations for each K outcome variable, and the error correlation matrix. One might also report a x matrix of P P K K ‚ Mahalanobis squared distance values. As a sidenote, another type of information that may be considered consists of the univariate values. This descriptive information may descriptive P F indicate to the reader some of the “strong” outcome variables, and, if an value is less than 1.00, F then that variable would be contributing more “noise” than “signal.” [Caution: Univariate tests F should not be used to assess relative variable contribution in a multivariate study.] Data Analytic Practices 19 Statistical tests. For MANOVA main, interaction, or contrast effects, the following test information is suggested: criterion (e.g., Wilks) value, test statistic value (with values), df p value, and effect size value. Information for contrast effects tests would be the same as for the omnibus effects tests. Labeling of linear discriminant functions (LDFs). This information would be relevant if an argument is implicitly or explicitly made for approximate equality of group (or cell) covariance matrices. The number of LDFs to consider may be determined in one or more of three ways (statistical tests, proportions of variance, and LDF plots; see Huberty, 1994, pp. 211-216). The retained LDFs may be interpreted/named/labeled by examining the LDF-variable correlations (sometimes called structure s). r Optional information. Some optional information that may be reported includes LDF plots, outcome variable rank ordering, and outcome variable deletion. These details are reviewed by Huberty (1994, chs. XV, XVI). The Analysis of Repeated Measures Designs Researchers frequently obtain successive measurements from their participants and consequently RM designs often provide the blueprint for experimental manipulations and data collection. RM designs are popular for a number of reasons. First, they are economical in comparison to designs that require an independent group of participants for each treatment combination of independent variables. That is, fewer participants are required in RM designs than completely randomized designs when the effects of certain variables can be measured across the same set of participants. This can be particularly advantageous when participants are expensive to obtain or measure or are scarce in number. A second major advantage of treating a variable as a within-subjects variable as opposed to a between-subjects variable relates to the power to detect treatment effects. By manipulating a variable as a within-subjects variable, that is, by exposing participants to all levels of a variable, variability due to individual differences across the levels of the variable is eliminated from the estimate of error variance thus making it easier to detect treatment effects when they are present. This gain in power can be substantial. Finally, in addition Data Analytic Practices 20 to economy and sensitivity, RM designs are clearly the design of choice when the phenomenon under investigation is time related, such as when investigating developmental changes, learning and forgetting constructs, or the effects of repeatedly administering a drug or type of therapy. In this content analysis, three categories were used to define the type of RM research design: simple, single-group factorial, and mixed. In a simple design, a single group of participants is evaluated at each level of one RM factor. In a single-group factorial design, on the other hand, a single group of participants is evaluated at each combination of levels of two or more RM factors. In a mixed design, participants are classified into groups or randomly assigned to groups on the basis of one or more factors and are evaluated at each level of a single RM factor, or at each combination of levels of two or more RM factors. The use of covariates in each of these designs was also noted. In any of these designs, the conventional ANOVA test is appropriate for testing RM F effects only if the assumption of (multisample) sphericity is met. When sphericity is an untenable assumption, either a adjusted univariate approach or a multivariate approach can be adopted. In dfthe former approach, the critical value used in hypothesis testing is based on numerator and denominator which are modified to reflect the magnitude of the departure from sphericity df reflected in the sample data. Two different -adjusted tests are typically recommended for use by df applied researchers, and are often referred to as the Huynh-Feldt and Greenhouse-Geisser (see Maxwell & Delaney, 1990) tests. MANOVA may also be used to test RM effects; this approach does not depend on the sphericity assumption. In designs containing quantitative covariates, the data may be analysed using conventional analysis of covariance (ANCOVA), -adjusted df ANCOVA, or multivariate analysis of covariance (MANCOVA) techniques. For RM designs which are multivariate in nature, and which are analysed as such, multivariate MANOVA or MANCOVA procedures may be used. Multivariate RM data may be analysed from either a multivariate mixed model or doubly multivariate model perspective (Boik, 1988). The former approach assumes that the multivariate (multisample) sphericity assumption is satisfied, while the latter approach does not. Other, less commonly used procedures for testing RM effects include Data Analytic Practices 21 nonparametric procedures, trend analysis, regression analysis, as well as tests for categorical data such as tests or chi-square tests of association. z As in between-subjects designs, MCP test statistics that are used in RM designs may be computed in different ways, depending on the assumptions the researcher is willing to make about the data (Keselman & Keselman, 1993). For example, in the simple RM design, one test statistic that may be used incorporates the error term for the omnibus test of the RM effect. As before, we will refer to this as a statistic because the error term is based on the data from all single-error levels of the RM factor. Accordingly, the sphericity assumption must be satisfied for such an approach to provide valid tests of pairwise comparisons (Keselman, 1982). The alternative, a separate-error statistic, uses an error term based on only that data associated with the particular levels of the RM factor that are being compared (Maxwell, 1980). Thus, in the latter approach, which does not depend on the sphericity assumption, each pairwise comparison statistic has a separate-error term. The same concept of singleand separate-error pairwise comparison statistics applies to factorial and mixed RM designs in which multiple withinand/or between-subjects factors exist, but the separate-error statistic may be computed in different ways depending on the assumptions the researcher is willing to make about the data. Research Design Features and Methods of Analysis Information pertaining to the classification of the research articles by the type of design is contained in Table 4. Mixed designs were overwhelmingly favored, and were represented in 190 articles (84.1%). Among this number, unbalanced designs (50.5%) were more common than balanced designs (40.5%), although 6 articles reported that both balanced and unbalanced mixed designs were incorporated in a single study (3.2%). Simple designs and single-group factorial designs were rarely used, and were only found in 11.5% and 10% of the articles, respectively. Total sample size varied considerably across the investigated articles and ranged from six to more than 1000 units of analysis. For mixed designs, 16 articles reported total sample sizes which did not exceed 20 units of analysis, and six reported values greater than 400. However, more than half of the mixed design articles (55.3%) reported total sample sizes of 60 or less units Data Analytic Practices 22 of analysis. An investigation of group/cell sizes in the articles which contained an unbalanced mixed design revealed that the ratio of the largest to smallest value was not greater than 1.5 in 56.3% of these. Among those articles in which a simple design was used, nine (34.6%) reported a total sample size of 30 units or less, while for the single-group factorial design articles, 14 (63.6%) did so. Information collected on the types of analyses is also contained in Table 4. As anticipated, inferential techniques were favored in the analysis of all three types of designs, and univariate analyses were more popular than multivariate analyses. In fact, none of the articles relied solely on multivariate techniques for the analysis of RM data; wherever multivariate analyses were performed, they were accompanied by univariate analyses. Table 5 contains information pertaining to methods of inferential analysis for RM effects. In this table, all of the articles in which the RM factor(s) had only two levels were excluded because in such cases, sphericity is trivially satisfied. If a design employed multiple RM factors, at least one had to have more than two levels in order to be considered in the subsequent analysis. Thus, for mixed, simple, and single-group factorial RM designs, the number of articles that were subjected to analysis were 103, 13, and 12, respectively. As Table 5 reveals, for mixed designs, the conventional ANOVA test was F overwhelmingly favored (68.9%). A small number of articles (3.9%) reported the use of a mixed design involving covariates for which the authors adopted the conventional ANCOVA test. In F only two mixed design articles was MANOVA used to test RM hypotheses and MANCOVA was used once. In one of the articles in which MANOVA was used, sphericity was evaluated using Mauchly's (1940) test; where a significant result was obtained, a multivariate analysis was adopted instead of the conventional ANOVA approach. In another article where sphericity was tested and a significant result was obtained, both the conservative test and -adjusted test F df F were applied to the data and it was noted whether one or both of the tests were significant. Both multivariate MANOVA (5.8%) and multivariate MANCOVA (1.0%) techniques were used, albiet in a limited manner; the multivariate mixed model perspective was adopted in all of these articles. Data Analytic Practices 23 In articles where multivariate MANOVA was used, the multivariate analyses were always followed by separate univariate analyses using the conventional ANOVA test. In the one article F where multivariate MANCOVA was used, no univariate tests involving RM effects were conducted; the authors were only interested in univariate tests of between-subject effects. Six articles reported an incorrect analysis of RM data from mixed designs. In four of these articles, the error did not correspond to those associated with the reported method of analysis df (i.e., ANOVA or MANOVA). In the six articles contained in the category, it not clearly stated was not possible to determine what method of analysis had been used because were not df reported, although it was typically the case that the author(s) stated that an ANOVA approach had been used. Five articles incorporated a mixed design but did not involve an analysis of RM effects; these were classified in the category of no RM analysis. MPCs of RM means were conducted in almost half of the mixed design articles (see Table 5). It is important to note that given our focus on methods of RM analysis, we did not examine procedures which were used to probe between-subjects effects. The most popular method for RM comparisons was Tukey's procedure, followed by the Newman-Keuls method. Of those mixed design articles in which pairwise comparisons were performed, marginal means were compared in 25 articles, while simple means were compared in 32 articles. In one of two articles the interaction effect was probed with tetrad contrasts using multiple t-tests. In a twoway design, a tetrad contrast essentially involves testing for the presence of an interaction between rows and columns in a 2 2 submatrix of the data matrix, and represents a test for a ‚ difference in two pairwise differences. In 43 of the articles in which mean comparisons were performed in mixed designs, it was not clear whether a singleor separate-error test statistic was employed. In seven articles however, a separate-error test statistic was employed. Table 5 also reports analysis methods for the simple RM designs. Here, use of the conventional ANOVA F test was reported in slightly more than one third of the articles. In six of the 13 simple RM articles, a MCP was used. The Bonferroni and Newman-Keuls procedures were Data Analytic Practices 24 most popular. In only one article was there an indication that a separate-error test statistic was used in conducting the pairwise comparisons. Finally, Table 5 reveals that in three-quarters of the single-group factorial studies, the conventional ANOVA approach was used. One of these articles also relied on a -adjusted df ANOVA test, in this case the Huynh-Feldt correction, when Mauchly's (1940) sphericity test F proved to be significant. Planned contrasts were used in two articles to test specific RM hypotheses in factorial RM designs; in both instances these contrasts followed an omnibus analysis. It is interesting to note that in one of these articles, which involved a 4 3 single-group factorial design, the test of the ‚ interaction effect was followed by a series of 2 3 planned interaction subanalyses to provide a ‚ more specific determination of the source of the interaction. Pairwise comparisons of means were conducted in one third of those articles in which a single-group factorial RM design was used; information pertaining to the methods adopted is contained in Table 5. It is clear that no one procedure was a clear favorite, as a different method was used in each of the articles. Pairwise comparisons of marginal RM means were reported in three articles, and of simple effect RM means in two. In none of the articles was it possible to discern whether a singleor separate-error test statistic was used. Assessment of Validity Assumptions References to problems of distributional assumption violations was evaluated for the entire data base, that is, for all 226 articles which incorporated RM designs. In total, in 35 of these articles (15.5%) the author(s) made reference to some aspect of assumption violations in performing tests of statistical significance. The most commonly mentioned issue was normality (n = 26), although none of these articles made reference to a specific test for normality. Rather, it appears that violations of this assumption were assessed via descriptive techniques. The most common method of dealing with nonnormal data was to transform the scores ( = 10), typically n with an arcsin method, although a small number of articles ( = 4) reported that outliers were n removed from the distribution of scores prior to analysis. Eleven other articles reported that a Data Analytic Practices 25 transformation had been applied to the distribution of scores, but gave no rationale for applyingthe transformation (i.e., these articles did not indicate that the normality assumption appeareduntenable). Various other problems with data were mentioned. For example, in one article,Levene's (1960) test was applied to the data due to a concern for variance heterogeneity, but theauthors did not evaluate the more complex assumption of (multisample) sphericity.Power Analysis/Effect SizeIssues of statistical power/effect size were considered in 20 of the 226 articles (8.8%) inthe database. In 16 of these articles, effect sizes were calculated, with the most common measurebeing Cohen's (1988) statistic. In three articles, the authors mentioned that statisticallyd significant findings may not have been revealed because of potentially low power, but noassessments of power were actually performed.Statistical Software Packages/CitationsOnly ten of the 226 articles in the RM database gave specific information concerning theuse of a statistical software package. The SPSS program was favored, and was used in seven ofthe research reports.A wide variety of statistical references were found in the 226 RM articles. The two mostpopular sources were Winer (1971) and Cohen and Cohen (1983), which were each cited fivetimes. The former was typically used as a reference for data transformations while the latter was areference for various statistical analysis issues in regression and ANOVA. Sources which wereused specifically for justification in the choice of a RM analysis technique included McCall andAppelbaum (1973), and Games (1980).Conclusions and Recommendations Concerning Repeated Measures DesignsEducational researchers make use of RM designs in a variety of contexts, but particularlyin the study of developmental changes over time. In these instances, researchers should anticipatethe existence of heterogeneous correlations among the repeated measurements, since participantresponses that are adjacent in time will typically be more strongly correlated than those which aremore distant. The existence of such serial correlation patterns will result in the data violating the Data Analytic Practices 26 sphericity assumption. It is impossible to evaluate the extent to which sphericity may be violatedin behavioral science research, as none of the authors of papers included in this review gavedetails of this aspect of their data. We recommend, however, that the conventional ANOVAapproach for tests of within-subjects effects be avoided because of the problems associated withcontrol of Type I errors under even a minimal degree of nonsphericity (Maxwell & Delaney,1990, p. 474).Furthermore, while it is difficult to evaluate the extent to which behavioral science datadeparts from the more complex assumption of multisample sphericity in mixed designs, we alsorecommend that the conventional ANOVA approach not adopted in such instances. In particular,tests of within-subjects interaction effects are highly susceptible to increased rates of Type I errorwhen the design is unbalanced and multisample sphericity is not satisfied (see e.g., Keselman,Carriere, & Lix, 1993; Keselman & Keselman, 1993).Despite the likelihood of the sphericity assumption not holding, this rarely appears to be aconcern for educational researchers. Rather, the results of this content analysis suggest that ingeneral, researchers do not give much thought to assumption violations when performing tests ofstatistical significance, as less than 16% of the papers made reference to this issue. When it wasclear that assumptions were considered, normality was more likely to be of concern thansphericity, and transformations were a common way of normalizing the distribution of responses.It is important to consider distributional assumptions not only when conducting omnibustests of effects, but also when pairwise comparisons of means or other contrasts are performed onthe data. A test statistic that employs an error term which is based on all of the data, in otherwords a single-error term, is based on the assumption of (multisample) sphericity. Rarely in thiscontent analysis was the choice of a test statistic specified, and thus it was difficult to determinewhat assumptions the researchers were making about the data.Furthermore, the general practice among the researchers whose articles were evaluated inthis content analysis is to probe interaction effects by conducting tests of simple main effects andpairwise comparisons of simple main effect means. This strategy is inappropriate for evaluating Data Analytic Practices 27 the nature of an interaction effect (Boik, 1993; Lix & Keselman, 1996; Marascuilo & Levin,1970) because simple effects are confounded by main effects. Thus, if the hypothesis associatedwith a simple effect test is rejected, the researcher can not conclude whether the result is due tothe presence of an interaction or a consequence of a marginal effect. The correct approach oftesting specific contrasts regarding the interaction was rarely seen.We recommend a number of ways by which behavioral science researchers can improvetheir analyses of RM data. First, we strongly encourage behavioral science researchers to considerthe adoption of analysis methods that are robust to RM assumption violations. Preliminary tests of(multisample) sphericity do not provide a sound basis for a data-analytic decision and shouldtherefore be avoided. Sphericity tests are sensitive to departures from multivariate normality andthus, rejection of the null hypothesis does not necessarily imply that the data are nonspherical(Keselman & Keselman, 1993; Keselman, Rogan, Mendoza, & Breen, 1980; Mendoza, 1980). Aswell, although transformations may result in a more nearly normal distribution, and may also helpto equalize heterogeneous variances (Ekstrom, Quade, & Golden, 1990), these manipulations ofthe data are not likely to change the correlational structure of the data.It is apparent that -adjusted univariate procedures and multivariate procedures aredf severely underutilized in behavioral science research. We strongly encourage the adoption ofthese two approaches for analysing RM effects in simple and single-group factorial designs. Anumber of references are available that can help to demystify these procedures and aid in adecision between them, including Davidson (1972), Keselman and Keselman (1993), Maxwelland Delaney (1990), O'Brien and Kaiser (1985) and Romaniuk, Levin, and Hubert (1977). Forthese designs we also recommend one of the newest approaches to the analysis of repeatedmeasurements, Boik's (1996) empirical Bayes (EB) approach. The EB approach is a blend of thedf-adjusted univariate and the conventional multivariate approaches. The major statisticalsoftware packages (e.g., the general linear model and/or multivariate programs from SAS andSPSS) can be used to obtain numerical results for each of these approaches. Data Analytic Practices 28 Furthermore, for mixed designs, although the adoption of either a -adjusted univariate ordfmultivariate procedure represents a good first step in terms of obtaining more valid tests of RMhypotheses, a new class of procedures that are not dependent on the multisample sphericityassumption are available and their use is strongly encouraged. Keselman et al. (1993) have shownthat an approximate multivariate solution can provide effective control of the Type I error ratedfin unbalanced mixed designs, provided that total sample size is sufficiently large. A programwritten in the SAS/IML language is given by Lix and Keselman (1995) for implementing thissolution as well as examples and SAS/IML code demonstrating its use. Other approaches to thisproblem are discussed by Algina (1994), Algina and Oshima (1994), Keselman and Algina (1997)and Keselman, Algina, Kowalchuk and Wolfinger (1997).A variety of multiple comparison procedures are available for data that do not satisfy themultisample sphericity assumption. An introductory paper on this topic is Keselman, Keselman,and Shaffer (1991). Current research in this area is discussed by Keselman (1994). As well, Lixand Keselman (1996) provide details of procedures that are appropriate for probing interactions inRM designs. In addition, their program can be used to obtain numerical results. A generaldiscussion of this topic is also provided by Boik (1993).Current research efforts are being directed towards the development of procedures thatcontrol the incidence of Type I errors and provide adequate statistical power when both thenormality and sphericity assumptions are violated. Wilcox (1993) considers this problem. Aswell, it should be noted that new methods for the analysis of RM effects that allow the appliedresearcher to model and specify the correlational structure of the data are now available in thepopular statistical packages (i.e., SASs PROC MIXED; see Keselman, Algina, Kowalchuk, &Algina, in press). However, at present, the limited information on this method suggests that it maybe problematic when the wrong covariance structure is selected by the researcher (Keselman etal., in press; 1997).We recommend that behavioral science researchers give serious though to the value ofmultivariate analyses, rather than considering individual dependent variables in isolation. Data Analytic Practices 29 Methods for the analysis of RM data in a multivariate context are discussed by Lix and Keselman(1995) and Keselman and Lix (1997).The Analysis of Covariance DesignsANCOVA has two purposes: First, in experimental studies involving the randomassignment of units to conditions, the covariate when related to the response variable, reduces theerror variance resulting in increased statistical power and greater precision in the estimation ofgroup effects. Second, in nonexperimental studies where random assignment is not used, thecovariate when related to the grouping variable, attempts to control for the confounding effect ofthe covariate.A great deal has been written regarding the data assumptions made when using theANCOVA model including: independence, homoscedasticity, homogeneity of regression slopes,linearity, and conditional normality. Violating the first three assumptions can seriously affect theType I error rate (Glass, Peckham & Sanders, 1972) particularly when the design isnonorthogonal (e.g., Hamilton, 1977; Levy, 1980).Research Design Features and Methods of Analysis For each journal we examined each article and selected those that reported the use of atleast one application of univariate ANCOVA. Regression analyses that referred to some variablesas covariates were excluded, as were studies which only reported on a multivariate ANCOVA.Most of the articles reviewed reported the results of several applications of ANCOVA as well asother analytic methods. In total we examined 651 articles and found 45 applications of ANCOVAfor a seven percent hit rate. A summary of our findings is provided in Table 6.All but one of the studies used the individual as the unit of analysis. One study providedtraining to groups of children and appropriately used the group mean as the unit of analysis. Onestudy analyzed both the individual and subgroups, and two studies were applications ofhierarchical linear models (HLM) and considered both individuals and classrooms as the units ofanalysis. Data Analytic Practices 30 In the applications of ANCOVA that we reviewed, two thirds of the studies (30) involvednonrandomization of the experimental units. This result supports what many believe, thatANCOVA is underutilized in experimental research (Maxwell, O'Callaghan & Delaney, 1993).In one study the researchers analyzed the data with and without the covariate. When theconclusions were the same, the researchers decided not to report on the details of the ANCOVA.None of the nonexperimental studies recognized the problem of measurement errors nor the factthat all of the confounding variables may not have been controlled. Although explicit causalstatements were not made, little effort was made to caution readers not to overinterpret the results.Many statistics textbooks which present ANCOVA limit their discussions to the one-factor design with a single covariate (e.g., Keppel, 1991; Maxwell & Delaney, 1990). Only themost advanced texts address multiple covariates, factorial and RM designs (Kirk, 1995; Winer,Brown, & Michels, 1991). Even the advanced texts do not discuss in great detail how theseanalyses might be carried out and interpreted. Among the studies using ANCOVA, over one-third(17) used a factorial design and 11 studies used a mixed model design. Thus almost two-thirds(28) of the studies were multifactor designs. In 19 of the studies multiple covariates were usedand in two studies the covariate varied by level of the within-subjects factor.Twenty-one of the studies had two or more between-group factors (17 factorial and fourmixed model designs) and 18 of these studies had unequal and disproportional group sizes. Theaverage group size in the nonorthogonal multi-group analyses equalled 34.5, while for thebalanced multi-group studies (3) the average group size was 35.3. For eight of these studies onlythe total sample size and the number of groups were reported. Twenty-one of the studies involveda single between-group factor (14 oneway and 7 mixed model designs), over seventy-five percent(15) of which had unequal and disproportional group sizes that averaged 37.1 units. The balancedsingle factor designs had an average of 19.4 observations per group. The inequality of group sizeswas not extreme for most cases. Two-thirds of the studies had a ratio of largest to smallest groupsize of less than two. In the single factor designs the largest ratio of largest to smallest group sizesequalled 8.06, while in the multi-factor designs the largest ratio equalled 5.15. Data Analytic Practices 31 In mixed model designs only the effects involving the between-subjects factor(s) areadjusted by the covariate when the univariate approach to RM is used for hypothesis testing. Noadjustment to the within-subjects is made because the same adjustment is made to all levels of thewithin-subjects factor(s) unless the covariate varies with the level of the within-subjects factor(s).If an adjustment is desired for the within-subjects factor then the multivariate approach to theanalysis of RM is needed (Ceurvost & Stock, 1978). Delaney and Maxwell (1981) point outhowever that the covariate must be adjusted by the covariate grand mean for the multivariate testto be meaningful. In further clarification of this point, Algina (1982) argued that the meanadjusted covariate is needed only when the covariate is a fixed factor, which is generally not thecase, and the meaningfulness of the hypothesis test for between-subjects, within-subjects andtheir interaction depends on the homogeneity of the within cell slopes.Only one of the 11 studies using a within-subjects factor cited the Delaney and Maxwell(1981) article and used the mean adjusted covariate. None of the articles stated that they used themultivariate approach to test the within-subjects factor. And only one study commented on theequality of the within group regression slopes.Twelve of the studies reviewed used a MANCOVA and all of these studies followed asignificant multivariate test with a series of the univariate ANCOVA tests. Only the univariateanalyses are discussed here.Twenty-one of the studies had at least three levels of an explanatory variable but onlyeight studies involved variables having more than three levels. Over half (27) of the studies didnot use a contrast procedure because either there were only two levels of the explanatory orgrouping variable or there were no differences among the levels of the grouping variable havingmore than two levels. In two studies contrasts would have been appropriate but were notcomputed and in one study post hoc tests were computed but not specified.When a MCP was used, the most common (6) procedure was the multiple test approacht using the pooled within-group variance. Five of these analyses were preceded with an omnibus Ftest. Two additional studies stated they used Fisher's LSD method but with unequal sample sizes Data Analytic Practices 32 these analyses were equivalent to multiple tests. Of the eight studies, five examined all pairwisetcontrasts, two studies examined a subset of all pairwise contrasts, and one study examined a set oforthogonal contrasts. Textbooks (e.g. Keppel, 1991; Maxwell & Delaney, 1990) generallyrecommend a Bonferroni adjusted multiple test procedure or the Bryant and Paulson (1976)tprocedure if the covariate is considered a random variable and all pairwise contrasts are ofinterest. None of the studies reported using a Bonferroni adjusted significance level or the Bryant-Paulson procedure, although one study referenced Seaman, Levin, and Serlin (1994) who showedthat when 2 FWE is controlled. Most of the studies reviewed here involved 2, in whichdfdfœŸ case a MCP to control the FWE therefore is unnecessary.Assessment of Validity AssumptionsAs we indicated previously, the deleterious effects of assumption violations areexacerbated when group sizes are unequal. The majority of the studies reviewed here involvedunequal and disproportional sample sizes. Thirty-four of the studies made no comment at allregarding the sample distributions or any attempt to determine whether it appeared reasonablethat the assumptions were met. Only 8 of the studies commented on the homogeneity ofregression slope assumption. Six of the studies found no evidence that the assumption wasviolated, one study found the slopes to differ on only one of the 17 outcomes examined andattributed the result to a Type I error. One study found the slopes to be unequal and proceeded toanalyze the data using gain scores. Ignoring the assumption of equal within-group regressionslopes is equivalent to assuming that there is no interaction between the covariate and thegrouping variable. In factorial designs researchers rarely are willing to assume no interactionbetween explanatory variables without at least testing that assumption. If the regression slopes areunequal an inappropriate adjustment is made in nonrandomized studies and in experimentalstudies at a minimum statistical power is lost. But perhaps more importantly the interpretation ofthe treatment effect is suspect when the interaction is present. Rather than ignoring the interactionhypothesis researchers might consider analyzing the data using methods that do not assumehomogeneity of regression as suggested by Rogosa (1980). Data Analytic Practices 33 Finally, only two studies considered normality and four studies commented onhomogeneity of variances. Only one study commented on a search for outliers.Power/Effect Size AnalysisSurprisingly, only 15 studies reported the adjusted means (it was assumed that reportedmeans were unadjusted unless explicitly stated), 11 studies provided some index of effect sizewith standardized mean difference being the most popular (7), and none of the studies examinedreported results in terms of confidence intervals. Several authors over the past several years (e.g.,Carver, 1993; Cohen, 1990; Schmidt, 1992) have recommended that in addition to or instead oftests of statistical significance indices of meaningfulness should also be reported. Some have evenrecommended abandoning the significance test in favor of effect size indicators and confidenceintervals (see, for example, Harlow et al., 1997). In the present sample of studies the behavioralscience researchers were either unaware of these recommendations or chose to ignore them.Software Packages/Citations Only four of the studies reported the computer package used, two used SPSS and two usedHLM (Bryk, Raudenbush, Seltzer, & Congdon, 1989). With a procedure like ANCOVA whereprogramming requires little judgment and programs basically report the same statistics, perhapsidentification of the specific package is unnecessary. However, when contrasts are tested not allprograms are alike. SPSS, for example, in factorial or RM designs do not compute the Scheffe,Tukey, Bryant and Paulson (1976), or Newman-Keuls MCPs, nor is it possible to compute allpossible pairwise contrasts. One wonders then whether these procedures were computed correctlybecause it requires some computation (Kirk, 1995, p 725) to get the correct standard error. SAS(1990), on the other hand, does compute all pairwise contrasts and complex contrasts can berequested. A Bonferroni adjustment can then be easily made. (SAS also does not compute theBryant and Paulson statistic.)Seventeen of the articles referenced statistics texts or methodological articles to supportthe procedures they used to analyze their data. The most frequently cited statistical reference wasthe textbook by Kirk (1982); it was cited three times. Data Analytic Practices 34 Conclusions and Recommendations Concerning Covariance DesignsOur review of 45 articles reporting applications of ANCOVA demonstrates the wideapplicability of this analytic technique. The technique has been used across a wide range ofdisciplines, a variety of age groups, and populations. Although extremely flexible in itsapplication, the 45 studies reviewed here only represent a small percentage of the potentialapplications. In particular, we found only a small number of applications in experimental studies.Researchers have failed to recognize the potential benefits of reduced error variance to increasestatistical power and improve precision. To the extent to which our sample of ANCOVAapplications is representative of analytic practice with the technique, it appears to us that mostreports of the analyses are inadequate and incomplete.Although ANCOVA is a versatile analytic tool, it can also be misunderstood, misused,and misinterpreted. Researchers appear to be unaware of, or at least fail to recognize, theassumptions that underlie the statistical models they use. The fact that most of the studiesreviewed involved unequal and disproportionate group sizes further raises the concern as to thestatistical validity of many research findings. Researchers have generally ignored the interactioneffects between the covariate(s) and the grouping variables and have failed to examine residualplots to identify heteroscedasticity and outliers. These preliminary analyses are necessary but theyneed not require an exhaustive discussion or require extended journal space. When heterogeneityof regression exists researchers should consider adopting the method presented by Rogosa (1980);hopefully, more robust methods will be become available for application to ANCOVA problems.A brief paragraph outlining the procedures used to examine the sample data and a summary of thefindings would substantially enhance the credibility of the data analysis. McAuliffe and Dembo(1994) demonstrated how these preliminary analyses may be succinctly reported.Researchers frequently do not provide adequate descriptive statistics including samplesizes at the smallest group level, pretest means, standard deviations and adjusted postest means.A summary table presenting differences among four groups reported by Steinberg, Lamborn,Darling, Mounts and Dornbusch (1994) is a nice example of how these data might be reported. Data Analytic Practices 35 Researchers have continued to overrely on hypothesis tests, reporting ratios and -values. EffectFpsizes and confidence intervals have been widely recommended but generally ignored by dataanalysts. Two exceptions are Simpson, Olejnik, Tam, and Supattathum (1994) who reportedstandardized mean differences, and Seidman, Allen, Aber, Mitchell and Feinman (1994) whoused eta-squared to further explicate their results.Summing Up and General RecommendationsBased on our surveys we made specific recommendations to researchers concerning howto improve the statistical analysis practice. In the space remaining we will punctuate our literaturereviews of data-analytic practices with: (1) comments directed at several general themes andprinciples evident in them; and (2) further observations and recommendations related toimproving the statistical analyses and reports of behavioral science research data.General Themes and Principles Of the several common themes and priniciples identified in the present set of reviews,three pertain specifically to “assumption validity” concerns. These may be summarized as dos anddonts for researchers in the following manner:1. Be wary. Behavioral science researchers should not automatically conduct a “standard”analysis. Times change, as related both to: (1) the advent of newer, more robust, analyticsolutions to assumption-violated data; and to: (2) what is known about the distributionalconditions under which a specific statistical test may or may not be appropriate. Conscientiousresearchers should work hard to be apprised of both those newer developments and thosediffering conditions (see Wilcox, 1998). Indeed, the text book procedures of the '50s and '60s(e.g., conventional univariate F-tests for the analysis of repeated measurements) have beenreplaced by more sophisticated analyses (e.g., the EB and mixed-model approaches to theanlaysis of repeated measurements) and reliance on the older methods may lead to misleading orerroneous conclusions.2. Be more intimate with your data. First, researchers need to: (a) have a clearunderstanding of the statistical model that underlies their analyses, (b) conduct a careful Data Analytic Practices 36 preliminary analysis of their data, and (c) provide a detailed report of their analytic results.Unfortunately, many of the articles we reviewed lacked one or more of these conditions. Withreference to point (b) researchers need to be more proactive in identifying potential distributionalabnormalities in their data by not relying exclusively on summary statistics (e.g., sample means,standard deviations, correlations). Rather, attempts should be made to delve further into one's data[e.g., Exploratory Data Analysis (EDA) techniques such as graphs can be examined, includingboxplots, normal probability plots, etc.; see Behrens (1997) for a discussion of EDA]. For datathat appear to conform to distributional assumptions, proceed in textbook fashion; but withnonconforming data, give serious consideration to more appropriate alternative analysisprocedures of the kind indicated in the present review. In accomplishing this goal researchersshould identify the statistical software package (particular programs or procedures) that was usedto obtain numerical results (by year or version or release). Numerical results for many of theanalyses recommended in our article can be obtained either entirely or in part from the majorstatistical software packages (e.g., the general linear model and/or multivariate programs fromSAS, SPSS, SYSTAT).3. Don't expect one size to fit all. Each new set of data contains its own distributionalidiosyncracies and different analytic tools are required for different types of data. Fortunately, andas was noted previously, both new developments in the statistical literature and the associatedcomputer software are proceeding apace. In fact, it could reasonably be argued (on the basis ofType I error and power characteristic studies) that if a “single size” were to “fit all,” that singlesize should be from the class of lesser-known Welch-based ANOVA alternatives, rather than thestandard test itself (see also Lix & Keselman, 1995). Similarly -and perhaps surprisingly toF many educational researchers -in the context of RM designs, the standard textbook-recommended univariate test is a disastrous single size to consider!FSeveral other data-analysis themes were presented in our article as well. These include thefollowing: Data Analytic Practices 37 Researchers should pay greater attention to the “substantive” significance (e.g.,ñ Robinson & Levin, 1997; Thompson, 1996) of their research findings -as reflected by variouseffect size and strength-of-relationship measures -rather than simply to the “statistical”significance of their findings. We similarly believe that confidence interval estimates shouldreceive greater use (see e.g., Harlow et al., 1997).ñ Researchers should regularly concern themselves with the statistical powercharacteristics of their studies. Even better, from our perspective, researchers should plan theirstudies (in terms of appropriate sample sizes) so as to have sufficient power to detect effects thatare deemed to be of substantive importance (e.g., Cohen, 1988; Levin, 1997).ñ Researchers should similarly think about their specific research questions prior toconducting their studies so that they can select the most appropriate and powerful analytictechniques by which to analyze their data (e.g., Levin, 1998; Marascuilo & Levin, 1970).Omnibus hypothesis tests should not be routinely conducted when individual contrasts form thebasis of the researcher's major questions of interest. Multivariate tests should generally bereserved for questions about multivariate structure. Thus, researchers need to translate theirresearch questions into andstatistical hypotheses.specific detailedñ Researchers should avoid making “logical inconsistency” errors (or what have beencalled Type IV errors (see Marascuilo & Levin, 1970) in their analyses. Incorrect interpretationsof rejected interaction hypotheses constitute but one salient example of this type of problem thatwas encountered in the surveyed studies.We, of course, know that there are a number of practical issues that affect researchpractice, and in particular, the manner in which data are analyzed. These include: (1) the (limited)training that occurs in graduate-level statistical methods courses; (2) the views of journal editorsregarding the types of analyses that they believe are appropriate; (3) the restricted ability ofresearchers to hire statistical consultants on their projects; (4) the inaccessibility and/orcomplexity of statistical software; and (5) the cultural milieu within the present-day educationalresearch community. We consider each of these obstancles in turn. Data Analytic Practices 38 First, with regard to graduate-level training, we have noted that quantitative methodscourses have diminished in the set of students' required/recommended courses in many of ourgraduate programs. When such courses are included in the curriculum, they are frequently taughtby colleagues whose speciality is not quantitaive methods. We consider such circumstances to beunfortunate and disadvantageous to students whose careers will involve either conductingempirical research or consuming empirical research findings.Second, editors of professional journals obviously have their own biases regarding the“proper” analyses that should accompany research reports, as well as the ones they would preferto see in their own journals. We can only hope that editors, in addition to the researchersthemselves, will take notice of the points raised in our review.Third, in this era of dwindling financial resources for educational research, possibilitiesfor allocating funds for statistical consultation have similarly dwindled. Nonetheless, withwhatever funds are available, educational researchers should consider adopting the medical modelwhere having a statistical consultant on board is common research practice.Fourth, in reference to the inaccessibility/complexity of noncommercially producedsoftware of the kind recommended in this article (e.g., Lix and Keselman's, 1995 SAS/IMLprogram for obtaining robust analyses, particularly in repeated measures designs], we note thatthey are becoming more accessible, almost on a daily basis, through the internet and itsdownloading facility. Some might argue that using such programs is beyond the capability ofresearchers who are not quantitative experts. We, on the other hand, do not subscribe to thatposition but instead maintain a “let's see” attitude. Frankly, we do not think our profession is wellserved if the newest developments in an area are hidden simply because of the fear thatcolleagues might find those developments challenging.Finally, what about the cultural milieu in educational research today? The appropriate-statistical-analysis message delivered in this article might seem like very “small potatoes” indeedin a field that is currently struggling with more overreaching philosophical issues, such as the roleand importance of quantitative methods in educational research. Why then are we soat all Data Analytic Practices 39 concerned about what seem to be much less important and esoteric matters as distributionalassumptions and the validity of statistical tests? Obviously, one isolated article, with its restrictedfocus, cannot resolve the quantitative-qualitative debate. In fact, the present article was notintended even to discuss it. Our purpose here was to argue thatinferential statisticalif and when methods are the analytic tools of choice, then at least those tools should be used wisely andproperly, in a “statistically valid” (Cook & Campbell, 1979) way. Improper use is likely to lead todanger, in the form of researcher conclusions that are unwaranted on the basis of the evidencepresented and analyzed. Consequently, our plea to educational researchers is twofold: ( ) be moreaconcerned about mismatches between your evidence and the conclusions you reach; and ( ) seekbout and embrace statistical methods that are known to reduce that mismatch.In conclusion, this review should serve as an wake-up call to substantive and quantitativeresearchers alike. Substantive researchers need to wake up both to the (inappropriate) statisticaltechniques that are currently being used in practice and to the (more appropriate) ones that shouldbe being used. Quantitative researchers need to wake up to the needs of substantive researchers. Ifthe best statistical developments and recommendations are to be incorporated into practice, it iscritical that quantitative researchers broaden their dissemination base and publish their findings inapplied journals in a fashion that is readily understandable to the applied researcher. Data Analytic Practices 40 FootnotesThis research was supported in part by a grant from the Social Sciences and Humanities Councilof Canada. Authorship is listed alphabetically within each tier.1. The content analyses that follow were originally presented as a symposium at the 1996 annualmeeting of The Psychometric Society in Banff, Canada. The title and authors of those paperswere: (1)(Lix, Cribbie, & Keselman, 1996),The analysis of between-subjects univariate designs(2)(Huberty & Lowman, 1996), (3)The analysis of between-subjects multivariate designsThe analysis of repeated measures designsThe analysis(Kowalchuk, Lix, & Keselman, 1996), and (4)of covariance designs (Olejnik & Donahue, 1996). The symposium concluded with a discussionby Joanne C. Keselman and Joel R. Levin. Data Analytic Practices 41 ReferencesAlgina, J. (1982). Remarks on the analysis of covariance in repeated measures designs.Multivariate Behavioral Research 17, , 117-130.Algina, J. (1994). Some alternative approximate tests for a split plot design. Multivariate Behavioral Research 29, , 365-384.Algina, J., & Oshima, T. C. (1994). Type I error rates for Huynh's general approximation andimproved general approximation tests. British Journal of Mathematical and StatisticalPsychology 47, , 151-165.American Psychological Association. (1994). Publication manual of the AmericanPsychological Association (4th ed.). Washington, D.C.: American Psychological Association.Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. PsychologicalMethods 2, , 131-160.Boik, R. J. (1988). Scheffe's mixed model for multivariate repeated measures: A relativeefficiency evaluation.,, , 1233-1255.Communications in Statistics Theory and Methods 20Boik, R. J. (1993). The analysis of two-factor interactions in fixed effects linear models.Journal of Educational Statistics 18, , 1-40.Boik, R. J. (1997). Analysis of repeated measures under second-stage sphericity: Anempirical Bayes approach., , 155-192.Journal of Educational and Behavioral Statistics 22Bryant, J. L., & Paulson, A. S. (1976). An extension of Tukey's method of multiplecomparisons to experimental designs with random concomitant variables., , 631-Biometrika 63638.Bryk, A., Raudenbush, S., Seltzer, M. & Congdon, R., Jr. (1989). An introduction to HLM:Computer program and users' guide. Chicago: University of Chicago Press.Carlson, J. E., & Timm, N. H. (1974). Analyses of nonorthogonal fixed effects designs.Psychological Bulletin 81, , 563-570.Carver, R. P. (1993). The case against statistical significance testing, revisited. Journal ofExperimental Education 61, , 287-292. Data Analytic Practices 42 Ceurvorst, R, W., & Stock, W. A. (1978). Comments on the analysis of covariance withrepeated measures designs., , 509-513.Multivariate Behavioral Research 13Christensen, W. F., & Rencher, A. C. (1997). A comparison of Type I error rates and powerlevels for seven solutions to the multivariate Behrens-Fisher problem. Communications in Statistics-Simulation 26, , 1251-1273.Cohen, J. (1988).(2nd ed.). Hillsdale,Statistical power analysis for the behavioral sciencesNJ: Erlbaum.Cohen, J. (1990). Things I have learned (so far)., , 1304-1312.American Psychologist 45Cohen, J. (1992). A power primer., , 155-159.Psychological Bulletin 112Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation for the behavioralsciences. Hillsdale, NJ: Erlbaum.Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues forfield settings. Chicago: Rand McNally.Coombs, W. T., Algina, J. (in press). New test statistics for MANOVA/descriptivediscriminant analysis..Educational and Psychological MeasurementCoombs, W. T., Algina, J. (1996). On sample size requirements for Johansen's test. Journalof Educational and Behavioral Statistics 21, , 169-178.Coombs, W. T., Algina, J., & Oltman, D. O. (1996). Univariate and multivariate omnibushypothesis tests selected to control Type I error rates when population variances are notnecessarily equal., , 137-179.Review of Educational Research 66Coursol, A., & Wagner, E. E. (1986). Affect of positive findings on submission andacceptance rates: A note on meta-analysis bias., , 136-137.Professional Psychology 17Davidson, M. L. (1972). Univariate versus multivariate tests in repeated measuresexperiments., , 446-452.Psychological Bulletin 77Delaney, H. D., & Maxwell, S. E. (1981). On using analysis of covariance in repeatedmeasures design., , 105-124.Multivariate Behavioral Research 16 Data Analytic Practices 43 Edgington, E. S. (1964). A tabulation of inferential statistics used in psychology journals.American Psychologist 19, , 202-203.Ekstrom, D., Quade, D., & Golden, R. N. (1990). Statistical analysis of repeated measures inpsychiatric research., , 770-772.Archives of General Psychiatry 47Elmore, P. B., & Woehlke, P. L. (1988). Statistical methods employed in AmericanEducational Research Journal Educational Researcher Review of Educational Research,, andfrom 1978 to 1987., , 19-20.Educational Researcher 17(9)Elmore, P. B., & Woehlke, P. L. (1998, April). Twenty years of research methods employedin American Educational Research Journal Educational Researcher Review of Educational,, andResearch. Paper presented at the annual meeting of the American Educational ResearchAssociation, San Diego, CA.Games, P. A. (1980). Alternative analyses of repeated-measure designs by ANOVA andMANOVA. In A. Von Eye (Ed.), Statistical methods in longitudinal research: Vol. 1, Principlesand structuring change (pp. 81-121). San Diego, CA: Academic Press.Glass, G. V, Peckham P. D., & Sanders, J. R. (1972). Consequences of failure to meetassumptions underlying the fixed effects analyses of variance and covariance. Review ofEducational Research 42, , 237-288.Goodwin, L. D., & Goodwin, W. L. (1985a). An analysis of statistical techniques used in theJournal of Educational Psychology, 1979-1983., , 13-21.Educational Psychologist 20Goodwin, L. D., & Goodwin, W. L. (1985b). Statistical techniques in articles, 1979-AERJ1983: The preparation of graduate students to read the educational research literature. EducationalResearcher 14(2), , 5-11.Hamilton, B. L. (1977). An empirical investigation of the effects of heterogeneous regressionslopes in analysis of covariance., , 701-712.Educational and Psychological Measurement 37Harlow, L. L., Muliak, S. A., & Steiger, J. H. (1997). (Eds.). What if there were nosignificance tests? Hillsdale, NJ: Erlbaum. Data Analytic Practices 44 Harwell, M. R., Rubinstein, E. N., Hayes, W. S., & Olds, C. C. (1992). Summarizing MonteCarlo results in methodological research: The oneand two-factor fixed effects ANOVA cases.Journal of Educational Statistics 17, , 315-339.Hsiung, T., & Olejnik, S. (1994a). Contrast analyses for additive nonorthogonal two-factordesign in unequal variance cases., ,British Journal of Mathematical and Statistical Psychology 47337-354.Huberty, C. J (1994).. New York: Wiley.Applied discriminant analysisHuberty, C. J (1996, August).. PaperSome issues and problems in discriminant analysispresented at the Joint Statistical Meetings, Chicago.Huberty, C. J, & Lowman, L. L. (1996, June).. PaperThe analysis of multivariate designspresented at the annual meeting of the Psychometric Society, Banff, Canada.Huberty, C. J, & Morris, J. D. (1989). Multivariate analysis versus multiple univariateanalyses., , 302-308.Psychological Bulletin 105Huberty, C. J, & Petoskey, M. D. (in press). Multivariate anlaysis of variance andcovariance. In H. E. A. Tinsley and Brown, S. (eds), Handbook of multivariate statistics andmathematical modeling. San Diego, CA: Academic Press.James, G. S. (1954). Tests of linear hypotheses in univariate and multivariate analysis whenthe ratios of the population variances are unknown., , 19-43.Biometrika 41Keppel, G. (1991).(3rd ed.). EnglewoodDesign and Analysis A Researcher's HandbookCliffs: Prentice-Hall.Keselman, H. J. (1982). Multiple comparisons for repeated measures means. MultivariateBehavioral Research 17, , 87-92.Keselman, H. J. (1994). Stepwise and simultaneous multiple comparison procedures ofrepeated measures' means., , 127-162.Journal of Educational Statistics 19Keselman, H. J., & Algina, J. (1997). The analysis of higher-order repeated measuresdesigns. In B. Thompson (Ed.),. Greenwich, CT: JAIAdvances in social science methodologyPress. Data Analytic Practices 45 Keselman, H. J., Algina, J., Kowalchuk, R. K., & Wolfinger, R. D. (in press). A comparisonof two approaches for selecting covariance structures in the analysis of repeated measurements.Communications in Statistics Simulation and Computation.Keselman, H. J., Algina, J., Kowalchuk, R. K., & Wolfinger, R. D. (1997). The analysis of repeated measurements with mixed-model Satterthwaite F tests. Unpublished manuscript.Keselman, H. J., Carriere, K. C., & Lix, L. M. (1993). Testing repeated measures hypotheseswhen covariance matrices are heterogeneous., , 305-319.Journal of Educational Statistics 18Keselman, H. J., Carriere, K. C., & Lix, L. M. (1995). Robust and powerful nonorthogonalanalyses., , 395-418.Psychometrika 60Keselman, H. J., & Keselman, J. C. (1993). Analysis of repeated measurements. In L. K.Edwards (Ed.)(pp. 105-145). New York:Applied analysis of variance in behavioral scienceMarcel Dekker.Keselman, H. J., Keselman, J. C., & Games, P. A. (1991). Maximum familywise Type I errorrate: The least significant difference, Newman-Keuls, and other multiple comparison procedures.Psychological Bulletin 110, , 155-161.Keselman, H. J., Keselman, J. C., & Shaffer, J. P. (1991). Multiple pairwise comparisons ofrepeated measures means under violation of multisample sphericity., ,Psychological Bulletin 110162-170.Keselman, H. J., Kowalchuk, R. K., & Lix, L. M. (1998). Robust nonorthogonal analysesrevisited: An update based on trimmed means. Psychometrika 63, , 45-163.Keselman, H. J., & Lix, L. M. (1997). Analyzing multivariate repeated measures designswhen covariance matrices are heterogeneous. British Journal of Mathematical and StatisticalPsychology 50, , 319-338.Keselman, H. J., Lix, L. M., & Kowalchuk, R. K. (1997). Multiple comparison proceduresfor trimmed means..Psychological Methods 3, , 123-141Keselman, H. J., Rogan, J. C., Mendoza, J. L., & Breen, L. J. (1980). Testing the validityconditions of repeated measures F tests., , 479-481.Psychological Bulletin 87 Data Analytic Practices 46 Kirk, R. E. (1982).(2nd ed).Experimental design: Procedures for the behavioral sciencesBelmont, CA: Brooks/Cole.Kirk, R. E. (1995).(3rd ed).Experimental design: Procedures for the behavioral sciencesBelmont, CA: Brooks/Cole.Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational andPsychological Measurement 56, , 746-759.Kowalchuk, R., K., Lix, L. M., & Keselman, H. J. (1996, June). The analysis of repeatedmeasures designs. Paper presented at the annual meeting of the Psychometric Society, Banff,Canada.Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.Journal of the American Statistical Association 47, , 583-621.Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions toprobability and statistics. Stanford: Stanford University Press.Levin, J. R. (1997). Overcoming feelings of powerlessness in “aging” researchers: A primeron statistical power in analysis of variance designs., , 84-106.Psychology and Aging 12Levin, J. R. (1998). To test or not to test H ?,0 Educational and Psychological Measurement58, 313-333.Levin, J. R., Serlin, R.C., & Seaman, M. A. (1994). A controlled, powerful multiple-comparison strategy for several situations., , 153-159.Psychological Bulletin 115Levy, K. J. (1980). A Monte Carlo study of analysis of covariance under violations of theassumptions of normality and equal regression slopes. Educational and PsychologicalMeasurement 40, , 835-840.Lix, L. M., Cribbie, R., & Keselman, H. J. (1996, June). The analysis of between-subjectsunivariate designs. Paper presented at the annual meeting of the Psychometric Society, Banff,Canada.Lix, L. M., & Keselman, H. J. (1995). Approximate degrees of freedom tests: A unifiedperspective on testing for mean equality., , 547-560.Psychological Bulletin 117 Data Analytic Practices 47 Lix, L. M., & Keselman, H. J. (1996). Interaction contrasts in repeated measures designs.British Journal of Mathematical and Statistical Psychology 49, , 147-162.Lix, L. M., & Keselman, H. J. (1998). To trim or not to trim: Tests of location equality underheteroscedasticity and nonnormality..Educational and Psychological Measurement 58, , 409-429Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumptionviolations revisited: A quantitative review of alternatives to the one-way analysis of variance Ftest., , 579-620.Review of Educational Research 66 Marascuilo, L. A., & Levin, J. R. (1970). Appropriate post hoc comparisons for interactionand nested hypotheses in analysis of variance designs: The elimination of Type IV errors.American Educational Research Journal 7, , 397-421.Maxwell, S. E. (1980). Pairwise multiple comparisons in repeated measures designs. Journalof Educational Statistics 5, , 269-287.Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: Amodel comparison perspective. Belmont, CA: Wadsworth.Maxwell, S. E., O'Callaghan, M. F., & Delaney, H. D. (1993). Analysis of Covariance. In L.K. Edwards (Ed.). New York:Dekker.Applied analysis of variance in behavioral scienceMauchly, J. W. (1940). Significance test for sphericity of a normal n-variate distribution.Annals of Mathematical Statistics 29, , 204-209.McAuliffe, T. J., & Dembo, M. H. (1994). Status rules of behavior in scenarios of peerlearning., , 163-172.Journal of Educational Psychology 86(2)McCall, R. B., & Appelbaum, M. I. (1973). Bias in the analysis of repeated-measuresdesigns: Some alternative approaches., , 401-415.Child Development 44Mendoza, J. L. (1980). A significance test for multisample sphericity., ,Psychometrika 45495-498.Milligan, G. W., Wong, D. S., & Thompson, P. A. (1987). Robustness properties ofnonorthogonal analysis of variance., , 464-470.Psychological Bulletin 101 Data Analytic Practices 48 Norusis, M. J. (1993).. Chicago, Ill: SPSSSPSS for windows advanced statistics release 6Inc.O'Brien, R. G., & Kaiser, M. K. (1985). MANOVA method for analyzing repeated measuresdesigns: An extensive primer., , 316-333.Psychological Bulletin 97O'Brien, R. G., & Muller, K. E. (1993). Unified power analysis for t-tests throughmultivariate hypotheses. In L. K. Edwards (Ed.), Applied analysis of variance in behavioralscience (pp. 297-344). New York: Marcel Dekker.Olejnik, S., & Donahue, B. (1996, June).. Paper presentedThe analysis of covariance designs at the annual meeting of the Psychometric Society, Banff, Canada.Olejnik, S., & Hess, B. (1997). Top ten reasons why most omnibus ANOVA F-tests shouldbe abandoned., , 219-232.Journal of Vocational Education Research 22Olejnik, S., & Huberty, C. J (1993, April).. Paper presented at thePreliminary statistical testsannual meeting of the American Educational Research Association, Atlanta.Olejnik, S., & Lee, J. L. (1990). Multiple comparison procedures when population variancesdiffer . University of Georgia. (ERIC Document Reproduction Service No. ED 319 754).Orr, J. M., Sackett, P. R., & Dubois, C. L. Z. (1991). Outlier detection and treatment in“I/Opsychology": A survey of researcher beliefs and an empirical illustration. PersonnelPsychology 44, , 473-486.Ridgeway, V. G., Dunston, P. J., & Qian, G. (1993). A methodological analysis of teachingand learning strategy research at the secondary school level., ,Reading Research Quarterly 28335-349.Robinson, D. H., & Levin, J. R. (1997). Reflections on statistical and substantivesignificance, with a slice of replication. Educational Researcher 26, , 21-26.Rogosa, D. (1980). Comparing non-parallel regression lines., , 307-Psychological Bulletin 88321.Romaniuk, J. G., Levin, J. R., & Hubert, L. J. (1977). Hypothesis-testing procedures inrepeated measures designs: On the road map not taken., , 1757-1760.Child Development 48 Data Analytic Practices 49 Roth, P. L. (1994). Missing data: A conceptual review for applied psychologists. PersonnelPsychology 47, , 537-560.SAS (1990).. Cary, NC: Author.SAS/STAT user's guideSchmidt, F. L. (1992). What do data really mean?, , 11873-1181.American Psychologist 47Seaman, M. A, Levin, J. R., & Serlin, R. C. (1994). A controlled, powerful multiple-comparison strategy for several situations., , 153-159.Psychological Bulletin 115Seidman, E., Allen, L., Aber, J. L., Mitchell, C., & Feinman, J. (1994). The impact of schooltransitions in early adolescence on the self-esteem and perceived social context of poor urbanyouth., , 507-522.Child Development 65Simpson, M. L., Olejnik, S., Tam, A. Y., & Supattathum, S. (1994). Elaborative verbalrehearsals and college students' cognitive performance., ,Journal of Educational Psychology 86(2)267-278.Steinberg, L., Lamborn, S. D., Darling, N., Mounts, N. S., & Dornbusch, S. M. (1994). Over-time changes in adjustment and competence among adolescents from authoritative, authoritarian,indulgent, and neglectful families., , 754-770.Child Development 65Stevens, J. (1996).. Mahwah,Applied multivariate statistics for the social sciencesNJ:Erlbaum.Tabachnick, B. G., & Fidell, L. S. (1996).. (3rd ed). New York:Using multivariate statisticsHarper Collins.Thompson, B. (1996). AERA editorial policies regarding statistical significance testing:Three suggested reforms., , 26-30.Educational Researcher 25West, C. K., Carmody, C., & Stallings, W. M. (1983). The quality of research articles in theJournal of Educational Research, 1970 and 1980., , 70-76.Journal of Educational Research 77Wilcox, R. R. (1987). New designs in analysis of variance.,Annual Review of Psychology38, 29-60.Wilcox, R. R. (1993). Analysing repeated measures or randomized block designs usingtrimmed means., , 63-76.British Journal of Mathematical and Statistical Psychology 46 Data Analytic Practices 50 Wilcox, R. R. (1995). ANOVA: A paradigm for low power and misleading measures ofeffect size?, , 51-77.Review of Educational Research 65Wilcox, R. R. (1996).. New York: Academic Press.Statistics for the social sciences Wilcox, R. R. (1998). How may discoveries have been lost by ignoring modern statisticalmethods?, , 300-314.American Psychologist 53Wilcox, R. R., Charlin, V. L., & Thompson, K. L. (1986). New Monte Carlo results on therobustness of the ANOVA F, W, and F statistics.*Communications in Statistics Simulation and Computation 15, , 933-943.Wilkinson, L. (1988).. Evanston, IL: SYSTAT Inc.SYSTAT: The System for statisticsWiner, B. J. (1971).(2nd ed.). New York:Statistical principles in experimental designMcGraw-Hill.Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimentaldesign. New York: McGraw Hill. Table 1. Journal Source and Frequency for the Content Analyses JournalBSUD BSMD RMD CD American Educational Research Journal4 4 5 Child Development16 34 56 10 Cognition and Instruction35 1 Contemporary Educational Psychology519 3 Developmental Psychology7 12 52 5 Educational Technology, Research andDevelopment11 Journal of Applied Psychology10 Journal for Research in Mathematics Education3 Journal of Counseling Psychology3 10 10 2 Journal of Educational Computing Research 1017 6 Journal of Educational Psychology6 9 20 7 Journal of Experimental Child Psychology533 1 Journal of Experimental Education3 Journal of Personality and Social Psychology6 Journal of Reading Behavior3 Reading Research Quartely1 Sociology of Education12 TOTAL61 79 226 45Note: BSUD=Between-Subjects Univariate Design; BSMD=Between-SubjectsMultivariate Design; RMD=Repeated Measures Design; CD=Covariance Design. Table 2. Between-Subjects Univariate Design and Methods of Analysis

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Internet Breast Health Information Use and Coping among Women with Breast Cancer

The objective of this work was to study, among breast cancer patients, if Internet health information use is associated with coping. Questionnaires were completed (n = 178) regarding Internet use and also coping as measured by the Brief Cope. MANOVA analyses were conducted for the 14 coping subscales. Following a significant MANOVA omnibus test, univariate ANOVA and multivariate ANCOVA analyses...

متن کامل

Bias, precision and statistical power of analysis of covariance in the analysis of randomized trials with baseline imbalance: a simulation study

BACKGROUND Analysis of variance (ANOVA), change-score analysis (CSA) and analysis of covariance (ANCOVA) respond differently to baseline imbalance in randomized controlled trials. However, no empirical studies appear to have quantified the differential bias and precision of estimates derived from these methods of analysis, and their relative statistical power, in relation to combinations of lev...

متن کامل

MANOVA method for analyzing repeated measures designs: an extensive primer.

This article teaches the multivariate analysis of variance (MANOVA) method for repeated measures analysis to researchers who are already familiar with regular analysis of variance (ANOVA) methods. Repeated measures designs are traditionally analyzed with mixed-model ANOVAS. However, sphericity violations markedly affect the true Type I error rates and power for the mixed-model tests. However, t...

متن کامل

Adjusting sample size for anticipated dropouts in clinical trials.

Statistical models for calculating sample sizes for controlled clinical trials often fail to take into account the negative impact that dropouts have on the power of intent-to-treat analyses. Empirically defined dropout correction coefficients are proposed to adjust sample sizes for endpoint analysis of variance (ANOVA) and analysis of covariance (ANCOVA) that have been initially calculated ass...

متن کامل

Analysis of Covariance

Overview: In experimental methods, a central tenet of establishing significant relationships has to do with the notion of random assignment. Random assignment solves a couple of problems. Statistically, it ensures that, in the main, the resulting probability will be independent of the starting conditions of an experiment. Secondly, it is a way of establishing parity, which is to say that it is ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002